Image classification

This notebook is a documentation of all the code that I am writing for my Year 11 Personal Project, which revolves around building a machine learning (ML) model to be able to predict the type of object present in a given image.

In this directory there is a large (2GB) dataset with images, which has a total of 256 different categories. Each category corresponds to a specific object. Soon enough, this dataset will be split into training, validation, and test sets. There will also be additional testing data consisting of pictures taken by me.

The first step required is to load the dataset.

Loading the dataset

As written above, there is a large dataset in this directory. I will need to load the dataset so that it is of a format that is more suitable for data analysis (more about this part later) and ML.

An issue that I did not previously identify is that I do not know how to load images into numpy arrays (which are much more suitable formats to work with). Numpy arrays can easily be converted into Pandas data frames for the exploratory data analysis (EDA) and can be used for many other different purposes.

As a result, I have done some research about this.

Finding a module to load images

It only took a few minutes before I found a useful module. The imageio module provides functionality to load an image into a np.array format, which is precisely what I needed!

I found it on this website[1] which presents tutorials using scipy. It made use of imageio as an auxiliary module. According to its PyPi page[2], imageio is quite a popular module, with 1322 stars on GitHub at the time of writing. It is also actively maintained - its latest version was released on October 2, 2023.

In the cell below, I load one of the images from the dataset into a np.array.

[1] http://scipy-lectures.org/advanced/image_processing/

[2] https://pypi.org/project/imageio/

From the above we can get that the image 001_0001.jpg, which is an image of an AK-47 gun, has dimensions of 278 by 499 pixels. The fact that the shape is (278, 499, 3) - note the 3 at the end - confirms that we have in fact been able to receive all three colors (red, green, blue) from the image.

Let's try another image.

This is problematic.

The two main problems that I can identify are:

From this I can conclude that my selected dataset is not great for ML purposes.

Finding a new dataset

After a bit of hunting around, I found the CIFAR-100 dataset.[1] This dataset contains 60000 32x32 images. The low resolution is slightly disappointing because I was hoping to be able to use this model at the exhibition and allow people to give it images, but I suppose it is a necessary evil to ensure that training does not take too long. This dataset is suitable because there are 100 categories organized into 20 higher-level "superclasses".

[1] https://www.kaggle.com/datasets/fedesoriano/cifar100/data

Let us load the dataset. Its format is different from the other one. I am using a method described on the kaggle page. (kaggle is a website where people can host datasets for data analysis and ML.)

Observations

It is clear that this dataset is organized in a different manner from the other dataset (the one with 256 categories).

One key thing to note is that they have split the dataset into a training set and a test set in a 5:1 split. My machine learning work method requires a validation set.

I have decided on the following distribution:

In order to ensure that all classes are represented across all sets, I need to write some code to distribute them.

Before proceeding, let us check the images to see whether they are up to our standards for machine learning training.

As I already said, that the 32x32 dimensions are problematic for us.

It's time to split the dataset into training and validation sets. Further, the shape of this dataset is quite inconvenient, so I need to reshape all of the arrays.

Train-validation split

I need to understand how the dataset is organized in order to properly split the dataset so that all classes are represented. Reminder: 100x400 images in training, 100x100 images in validation.

As can be seen, the dataset is pretty disorganized, so it is necessary to populate the train and validation sets carefully so that each class is represented 400 times in training, and 100 times in validation.

This is a program to create the proper sets.

Above, you can see that I have made some choices about which classes to keep and I have also merged some classes together. These are based on considerations related to the feasibility of distinguishing between similar classes (e.g. small vs. medium sized mammals). I'm not sure whether large carnivores should me merged with large omnivores/herbivores but I have decided to leave them separate.

Some classes contain more images than others, so that is something to watch out for.

I aim to construct a full convolutional neural network architecture and then tune it based on the results.

Reminder: the classes are:

1) aquatic animals (classes 1-10) 3) flowers (classes 11-15) 4) food containers (classes 16-20) 5) fruit and vegetables (classes 21-25) 6) household electrical devices (classes 26-30) 7) household furniture (classes 31-35) 8) insects (classes 36-40) 9) large carnivores (classes 41-45) 10) large man-made outdoor things (classes 46-50) NOT INCLUDED 11) large natural outdoor scenes (classes 51-55) NOT INCLUDED 12) large omnivores and herbivores (classes 56-60) 13) small to medium sized mammals (classes 61-65, 81-85) 14) non-insect invertebrates (classes 66-70) 15) people (classes 71-75) 16) reptiles (classes 76-80) 17) trees (classes 86-90) 18) vehicles (classes 91-100)

Source: https://www.researchgate.net/publication/365130408_Optimal_Design_of_Convolutional_Neural_Network_Architectures_Using_Teaching-Learning-Based_Optimization_for_Image_Classification https://medium.com/analytics-vidhya/how-relu-works-f317a947bdc6

Convolutional layer

The convolutional layer's job is to detect low-level features in the image data. In our specific case, these low-level features are important because, for instance, they can help to distinguish between household furniture (which are characterized by straight lines and right angles) and reptiles (characterized by curves).

The features can be very subtle in some cases, e.g. reptiles vs. invertebrates, so I think I will use a small kernel (3x3). This is also done to ensure that training doesn't take too long. The stride will be 1x1 to ensure that I don't miss anything.

For the activation function, I will use the ReLU function. This is because I don't think that the vanishing gradient problem is particularly problematic for this task, since the types of predictions made are quite high-level anyway. Also, as a result it wouldn't take too long for the ReLU to cause the parameters to converge (comparatively).

After the conv layer there will be pooling. Max-pooling is what I intend to use (because that way only the most important features are kept). The dimensionality of the output will be 16x16 so a 2x2 kernel will be used.

It is common in CNNs to have multiple conv layers (with multiple pooling layers as well). For this reason (and for reasons related to the data that I have observed) I will add another convolutional layer. Reason being, only the "type of object" is being predicted, so there need not be too many features. It's better to lower the number of features to prevent overfitting.

So, another convolutional layer with a 3x3 kernel and 1x1 stride, and we activate with the logistic function this time. This is to prevent there being too much dichotomy and there is the possibility for features to be "half detected". Another max-pooling afterwards.

Now come the fully connected layers. There will be three fully connected layers in the neural network.

Slight issue: the best performance is 18.4%

We need to improve the neural network, and one thing that needs to be improved is to add a learning rate (which needs to be tuned) and change the architecture as well.

It can be seen that when mu (learning rate) is equal to 0.5, the model achieves the best accuracy. Notice how at higher learning rates, the model degrades to a dummy classifier (predicting the most frequent one), and at even higher learning rates it would appear that it bound itself to some infrequent category (for whatever reason). An accuracy of 0.21 is the best we have yet, which is still pretty bad but it can be improved through other means.

New best validation accuracy: 22.7%. It's a step in the right direction to replace "sparse_categorical_crossentropy" with "categorical_crossentropy" to ensure that the relative probabilities get taken into account, because objects can fall into "gray areas" and there are multiple different types that can look similar to each other. A goal is to get the model to predict the right "even higher-level" type of an object, and so this is better.

On a related note, I wonder whether there is a loss function that takes into account differing levels of similarity between "object types".

Quite frankly, no matter how I change the paramters, the accuracy is depressingly bad.

I have come to the conclusion that I have to examine the underlying architecture of the model and make changes according to the dataset. EDA is hard to do with this type of data, but I nevertheless found that:

We therefore need a convolutional step capable of detecting the rather sharp identifying features but not the noise. We need a sort of dichotomy in our feature extraction.

This could be achieved using, for instance, the sigmoid function. But this function is not suited for classification, so let's use softmax for activation in Conv2D.

It can be seen that each category has 50 or more images that contain a 2x2 white area. While this is not a perfect measure of whether or not the image has a white background, it is good enough for our purposes. If we took 50 per class we would have a dataset of 750 images which is far less than the 45000 we started with, but it's not nothing, and we could try fitting a CNN to it to see what would happen.

The reasoning is that CNN is picking up on far too much background noise (that's my hypothesis) so it would be nice to test it out.

I will recollect my thoughts here.

I was able to make some progress with training a CNN on white-background-only images. My reasoning was that the background noise would not be detected, so a simple ReLU could eliminate all the background, and then the model could focus more on extracting actual features.

But its best training accuracy is only 22.8%, which is still very low. It might be useful to look at what the CNN is thinking (visualize a feature map).

I've noticed that a lot of the time the model fails to get past some distinct boundary (something like 20%) so I wonder if it is converging towards a local, but not global, minimum. The categorical cross entropy loss function is based on logarithms of the probabilities, so it penalizes the model for getting the right answer but not being confident about it. Nevertheless, the penalty is even more harsh when we apply a softmax transformation to the output vector before categorical cross-entropy. In my view, this should force the model to learn more because often, it is completely indecisive, as seen just above.

Overall, the loss function that I used seems rather unsatisfactory and doesn't seem to have forced the model into learning as much. Let's define our own loss. It is based on the "softmax loss" that comes from one of the sources I cited in my process journal.

As it would turn out, the solution to our problem is surprisingly simple: use the Adam optimizer!

The difference between SGD and Adam is that Adam dynamically computes learning rates as the training goes on. I decided to give it a go just after reading about it, because of my suspicion about what was happening with SGD, on account of the fact that setting an individual learning rate for various different levels of feature specificity (at a high-level, rigid lines probably means furniture/electronics, whereas lots of spikes could mean quite a few things) probably wouldn't work. And suddenly, after 20 epochs a 53% accuracy on the training set was achieved!

But for some reason its validation accuracy is 25%. Overfitting ever?

The model is overfitting.

We can see that the validation accuracy is consistently hovering in the low 20s, while the training accuracy is increasing almost to 50%. I wonder why this is. Potentially, the extra dense layer.

This problem is even more serious than I thought... it was able to achieve a 100% accuracy on the training set but only 30% on validation...

I think the main problem is that there are way too many learnable weights in the dense layer. Some downsampling is needed.

It's a step in the right direction. But it's still pretty serious overfitting. I'm not really sure why it's doing that, especially given that I shuffled the training and validation data.

Let's try adding a small dense layer before the first one.

It looks like a step in the right direction. The addition of a small dense layer before the output layer has reduced the number of learnable parameters, which tends to lead to a reduction in overfitting as well. But now it's failing to fit altogether.

It is worth thinking about the fully-connected-layer architecture at this point. The ReLU + softmax + max-pool + softmax + max-pool architecture of the convolutional/pooling part appears to have succeeded, with relatively minimal parameters. But the dense layers also form a crucial part of the CNN architecture.

I have redefined the "large" dataset of 45000 samples to achieve class balance. With class imbalance I trained on the most recent setup and achieved 20%. With class balance I achieved 38% accuracy. This is a much better result suggesting that the reason behind most of our previous struggles was probably class imbalance. But it's still not great because it looks like the neural network is struggling to get past 38% accuracy-wise, although surprisingly enough this accuracy is mirrored in the validation set and it's already higher than anything we've had with the white images.

It's probably a signal that the low number of images in the white dataset have made it easier for the neural network to more completely memorize the training set, which is obviously not feasible with 30000 samples. But notice also that the neural network appears to have converged to a local minimum with the large dataset which might mean it's struggling to get rid of the background noise. Therefore I will stay with the white dataset for a bit longer.

Regarding the overfitting I am considering tuning the batch size as a hyperparameter. Perhaps the low batch size made it easier for the network to overfit.

There are drastic amounts of overfitting going on with batch sizes of less than or equal to 16. It looks like a batch size of 32 achieves the least amount of overfitting while actually fitting to the training set, and this is what many sources will state. But it's nice to see it in practice.

Date: 20 November

I have just realized that potentially it is the high number of epochs that is causing the overfitting. 50 cycles of the training set is way too many and it is enough to allow the model to memorize the training set. Trying with only 25 epochs instead, but with the same setup.

Less overfitting this time. So now the main task is to improve the accuracy, and this can be done by reworking the architecture of the neural network.

The convolutional + max-pooling part already consists of multiple layers. But now it's the dense part that should be improved. The dense layers, particularly the hidden layers, play an important role in using the features extracted by the convolutional layers to make deductions. One layer is probably not good enough. Therefore I will place another layer before it. It is important that this layer not have too few or too many neurons, 20 seems like a good number.

https://keras.io/api/layers/initializers/

It got a lot worse.

Quick update: I added initializers and biases to every layer. After this change, when I use 1 dense layer it gets 40% train vs. 30% validation - on par with before.

Adding an additional dense layer brings it down to 13% vs. 13% which is horrendous. Perhaps I should be using ReLU in the first dense layer.

Improvement.

(I turned on "verbose" so I could see the progressive learning better.)

The accuracy climbs to 26% vs. 24% with this neural network. Apologies for the lack of visual communication.

It's probably for similar reasons to using ReLU in the first convolution.

What if we try similar things with the dense layers to what we did for the convolution/max-pooling? Namely, a hidden layer with softmax, with something like ~35 neurons?

So, I had a misconception. I forgot that softmax normalizes outputs to form a probability distribution. But then I realized that it amplifies higher values even further, so it might amplify parts that are actually in the image and not those outside.

Regardless of the fact that I added an extra conv+maxpool to reduce the size of the input to the fully connected part to be 48 instead of 192, overfitting (and also a lack of learning) still occurs which puzzles me.

I wonder whether there are other activation functions that don't normalize to this range [0, 1]. Found nothing, so I changed all activation functions to ReLU except for the last layer.

It still overfits for whatever reason.

My preconceptions were challenged (yet again!) when I discovered that the low dataset, which I thought could improve the model's performance, could actually be the cause of overfitting. In this case decreasing learnable parameters may have helped to a certain extent, but ultimately the low dataset better (?) enables the model to memorize the training set. Also, as it turns out the softmax function outside the last layer (where I really chose it because it was necessary) suffers from a "vanishing gradient" problem. Because Adam optimizer is an adaptation of the stochastic gradient descent optimization method, this could possibly be why the model failed to learn the large set.

The training of my new model on the large dataset has just concluded. It again appears to have stagnated, even lower than last time. I will stick with this large dataset in spite of my considerations about background noise, because ReLU should be able to account for that. More on this issue down below.

Overfitting has most definitely subsided in its effect. We can see that validation accuracy very closely mirrors training accuracy. So the problem can be reduced to simply fitting the model to the training set.

Let's do a bit of EDA to investigate the "background noise" issue.

Thoughts:

We can see that there is a strict "dichotomy" between background and object. This was very obvious and intuitive in the case of the small dataset, but it also applies here, albeit in a more subtle manner. A good ReLU activation should be able to capture this dichotomy well enough for us to be able to use the large dataset. Here I think it applies. Training might take a lot longer which is why I should really think through each change I make to the architecture henceforth.

The amount of blur in the background is also quite a lot higher than in the object itself (although of course there is also some blur inside the object as well). The image inside the object appears to be "sharper" than it is on the outside. This is a subjective measure to a human, but we can make the machine learning model "learn" what a blur is. It might just require us to enlarge the kernels of some of the convolutional layers slightly. I do think a ReLU is called for, though. First layer kernel will be enlarged to 5x5 (odd size is still necessary for symmetry purposes)

Overfitting should not occur because the dataset is quite large (and 27 learnable parameters vs. 75 learnable parameters) within the first convolutional layer is not much of a change especially when considering the 5x48+15x5=315 learnable parameters in the fully connected layers.

I keyboard-interrupted the training process after only five epochs because it didn't seem to be making much progress. If anything its accuracy is slightly lower than when I set the kernel size to 3x3x3 rather than 5x5x3. It might be that it's taking longer to converge because there are more parameters involved.

But I think its low accuracy for all this time is simply because the convolutional architecture is limited and not enough to properly extract features. It is possible to remove the background noise and learn stuff about the object itself; we just need more layers.

Come to think of it, increasing the kernel size shouldn't have much of an effect and just increases the training time pointlessly. Instead I should probably add more convolutional layers that use the ReLU activation. Trustworthy sources from Towards Data Science state that the ReLU allows a neural network to create "folds" around the training set. This sounds like overfitting but done in moderation it (hopefully) shouldn't be a problem.

It could be that the three max-pools are too many and remove crucial parts of the image. I will reduce max-pools to two but put the network through two convolutional layers before each one.

Further, stressing the importance of the dense layers I will increase the number of neurons in the first dense layer to 16 because 5 neurons can cause information to be abstracted away.

Conducting an adequate scientific investigation into the effects of the changes described above will require controlling some of them and changing one thing at a time to understand what's really going on. Let me change the dense layer first.

It helps.

I do think that increasing the number of neurons in the first dense layer is a necessary step to take, because five is way too few. 16 seems like a good number.

Let's experiment now with the idea of removing one max-pool and adding additional conv layers.

It helps more.

I have the idea of adding more neurons to the dense layer. More filters for convolutional layers in order to better the feature extraction.

The results from adding more filters are promising indicating that it was a problem with feature extraction. This makes a lot of sense because there were already many neurons in the dense layers (it's not shown, but I trained another model with only 5 filters but 64 neurons in the first fully-connected layer, and it got a considerably worse performance. My logic was that more filters are needed to capture lower-level features. You don't need as many for higher-level, although perhaps I should add more. In fact, let's try some of that.

The results from adding more filters to the convolutional layers are very promising indeed! We broke the 40% barrier for the very first time on the validation set. I don't think this change should cause significant overfitting because 1) the dataset is large enough that feasibly training a model to fit to it within a low number of epochs is not doable, and 2) the kernels are only 3x3x3. Adding 22 more kernels (in the case of the first layer) barely adds any learnable parameters compared to the thousands of parameters added to the dense layers. Things are looking up!

UPD: Some small amounts of overfitting were detected. I suppose that reducing the number of kernels in convolutional layers to 24/12 instead of 32/16 could partially mitigate this issue, as it was not present with a 10/5 split.

Tomorrow there will be a meeting with an AI expert (tomorrow being 23 November). Things to discuss:

Insist on the fact that this project is a mechanism to learn about neural networks in general.

UPD2: Clearly there is significant overfitting happening which has to be directly caused by the additional kernels.

New best validation accuracy: 44.5%

Interview:

Works at the International Computing Center for the United Nations. Name: Adrian Errea

Notes:

Elaboration on "error analysis". Choose some samples and run the model, see what type of sample it struggles with the most and factor in to decision-making

Training on smaller datasets should not cause too much overfitting, it allows for more epochs and deeper networks

The CNN that I am training has memorized the entire training set by heart, which I find very funny but is also bad news for our project. Lowering the number of filters doesn't help.

The fact that the CNN learnt the entire set by heart is not necessarily indicative of its ability to learn features that it can sustain over time, but rather it is just memorizing data. This is extremely severe overfitting and it needs to be fixed.

What samples does it struggle with?

Observations.

I ran the above experiment about 20-30 times (granted, it was quite fun) and I made some very useful observations about the model.

Plan: merge "carnivores", "omni/herbivores", and "small to medium mammals" into one class (called "land animals"), which means total 13 classes. Now take 300 random samples from each class to form train + validation sets. 80-20 split means 3120 training samples and 780 validation samples.

Overfitting still occurs. But at least now we are using a dataset with more general images and the three similar classes of "carnivores", "omni/herbivores" and "small to medium sized mammals" are now condensed into one class.

Let's try to solve the overfitting problem using dropout. This way, we ensure that the model fully learns features rather than noise.

Source: https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9, https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

It seems that no matter what I change, at some point the loss function in validation starts getting reverse-optimized. Or more formally, just increasing after some period of decrease.

This is an indicator of overfitting, but it's just as much an indicator that it's probably an underlying problem with the data that is causing this issue. I'll run the investigation again.

Today is Friday, 24th of November. The goal is to reduce overfitting. Previous attempts to use dropout didn't work too well, so let's try different approaches.

Firstly I'm not even sure whether "shuffle" will shuffle the entire set, or split 80-20 and then shuffle. To be completely sure...

We can see that the overfitting is somewhat mitigated when we increase the batch size. But it still manages 65% accuracy on training set.

Given the small sizes of the images, the advice I was given to train on a smaller dataset can be "stretched". I would say, doubling the sizes of the datasets should rather positively affect performance. That is, 6240 training and 1560 validation. I have genuinely had it with the overfitting. I don't feel that reducing the complexity of the model can positively affect results because feature extraction is already not very strong as it is. I can add more layers but with this small a set it would probably just overfit even more. So I will double dataset size. But I do agree that training on 37500 samples is a very bad idea.

One observation I just made is that there are some grayscale images which could unjustly sway the opinion of the CNN. How prevalent are these images actually?

There aren't that many grayscale images, but it does little harm to remove them from the dataset, because the dataset size of 6240 train / 1560 valid is small enough that these grayscale images could contribute to the overfitting.

While we're at it, let's also remove those images, that have a white background. Our goal is to keep the images that were taken with a lot of background noise, which theoretically is a bad thing but since there are so few images without background noise (that there would be overfitting unless we severely reduced the model's capacity in which case there's a lack of learning) it should be the right thing to do.

It didn't help. (Why should it have?)

https://www.dataquest.io/blog/regularization-in-machine-learning/

A weight-decay regularization approach may be necessary to ensure that the model learns properly (and not just on the training set).

Further investigation showed that the weights aren't even that high (or maybe they are and I just don't know what "high" means). Regardless, I just got the idea of reverting to the standard stochastic gradient descent optimizer and tuning its learning rate as a hyperparameter. And obviously reducing filters in earlier layers. The reasoning is that Adam adapts its learning rate which could lead it to adapt it for the purpose of memorizing the training set.

It was a grave misconception.

Date: 26 November

I found this publicly available research paper at https://www.researchgate.net/publication/331677125_An_Overview_of_Overfitting_and_its_Solutions about overfitting. I would like to try the L2 regularization with my neural network, as opposed to L1, because this way features are still learned albeit with smaller weights.

Used documentation at https://keras.io/api/layers/regularizers/

why does it keep overfitting

Observations, round 2

There are certain pairs of classes that closely relate to each other, e.g. fruits and vegetables vs. flowers; and it probably should be highlighted to the model that these two classes are different. Something similar was previously happening with land animals, but it was always going to be hard for the model to distinguish between omnivores and carnivores. (how do you even do that?) But fruits and vegetables vs. flowers is not particularly hard to distinguish. Same goes with furniture vs. electronics. I will therefore experiment with the idea of adding additional penalty terms to the loss function for these common "50/50" choices. (trees vs. insects is another 50/50 choice although it's less pronounced.)

It is important to note that for some of these 50/50s, the model only predicts one of them wrongly. So for instance, trees vs. insects: the model almsot never predicts trees on insects, but vice versa is very common. Specifications:

Observations, round 3

The main observation I have been able to make is that the model fails to detect very subtle images. Also there are a lot of "reptile" images of sea turtles and sea snakes and so on that could very easily be classified as sea animals (because of the blue background). Perhaps I should add this as a "50/50" pair for the loss function.

Indeed, adding a 1.2 scale factor to the loss function for the aforementioned 50/50 pairs has improved the accuracy slightly. Let's try more of that.

It seems as though the long-overdue EDA is finally coming now, with randomized analysis ;)

Bruh I've been typing it in wrong the whole time ;) I genuinely don't know how it managed to improve.

We'll just have to wait and see I guess...

Combating the issue requires more regularization. Also I will add a hidden layer.

In effect, the problem is reduced to solving the overfitting issue. I already have a model that can reach 77% accuracy on the training set in 20 epochs, now we just need to make it learn on validation as well.

Let's run another regularization grid search. 0.001, 0.004, 0.006, 0.008, 0.01.

29 November

UPD: I found out that L1 regularization may be better for this project.

https://neptune.ai/blog/fighting-overfitting-with-l1-or-l2-regularization

L1 is "more robust to outliers", which through my "random EDA" (which consists of taking random validation images and looking at the model's thought process on them) seem to be very common. Rockets, for instance, could be considered outliers within the set of vehicles, because they don't resemble other vehicles such as cars and trains. It makes sense from a mathematical point of view that L1 regularization would work better in outlier-heavy datasets, because L2 regularization "inflates" the outlier weights through the squaring. You know what, let's try a bit of L1 regularization in this project. It could be that the model's failure to learn is due to our use of L2 regularization because it focuses on outliers, and in particular those in the training set. It also could be why overfitting is still witnessed in spite of regularization.

30 November

Today is the last day of November. I performed some more EDA and decided that reptile images with blue background should be removed as they are virtually indistinguishable from sea animals. No, genuinely.

Also, the size of the dataset will be reduced once again to 2600+650, because then less complexity is needed which 1) reduces training time and 2) reduces overfitting.

Detecting a blue background in reptile images is kind of tough. I'll experiment with different methods.

Alright, then.

https://cs231n.github.io/neural-networks-2/#init

Ever initialize the biases to 0.01?

http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

https://pythonbasics.org/flask-rest-api/

https://medium.com/geekculture/how-does-batch-size-impact-your-model-learning-2dd34d9fb1fa

The batch size is another hyperparameter that could be tuned. According to the article above, decreased batch sizes often result in better generalization. And it's not too difficult to see why, since lower batch size means that a network doesn't need to compute the gradient as accurately.

Let's try a network with no dense regularization but a batch size of 16 instead of 32.

4 December

EDA, round 4.

Note: I know that today is supposed to be a "guiding deadline" to finish the product. But in light of the advanced nature of this project, I find it permissible to keep working on this until perhaps as late as next Wednesday night. The point is to get this thing to as high of a level as possible.

What I find is that the model struggles to cast out irrelevant details and focus on the "main object". For instance, if one shows the model an image of an orange placed on a table, clearly the orange will be the main focus of the image. But the model instead predicts "furniture", which isn't wrong but is just missing the point. In other cases, an image of a table will be presented without anything else, so clearly the focus of the image is on the table and the model should predict "furniture".

So overall the model struggles with focus. Perhaps this is why overfitting is perceived and also why it can't be fixed with traditional methods such as regularization. Addressing the overfitting issue requires addressing far more fundamental problems with the data. No samples need to be removed; the model just needs to learn focus. And many of the images in the dataset are weird in this regard. You could show it a fishbowl with a fish inside it and it would say "food containers" instead of "sea animals".

As for how I would address this issue, regularizing the convolutions would be my immediate thought. But it hasn't worked regardless of how many times I try it. This is something that I intend to discuss tomorrow with Mr. Errea during our second meeting.